1
MLLM架構的演進:從以視覺為中心到多感官整合
AI012Lesson 7
00:00

MLLM架構的演進

多模態大型語言模型(MLLM)的演進,標誌著從特定模態的孤島轉向統一表示空間,在其中非文字訊號(圖像、音訊、3D)被轉換成語言模型能夠理解的語義形式。

1. 從視覺到多感官

  • 早期的MLLM:主要專注於視覺變壓器(ViT),用於圖像-文字任務。
  • 現代架構:整合音訊(例如 HuBERT、Whisper)以及3D點雲(例如 Point-BERT),以實現真正的跨模態智能。

2. 投影橋接

為了將不同模態與語言模型連結,需要一個數學橋接機制:

  • 線性投影:早期模型(如 MiniGPT-4)中使用的簡單映射。
    $$X_{llm} = W \cdot X_{modality} + b$$
  • 多層MLP:一種兩層方法(例如 LLaVA-1.5),透過非線性轉換實現對複雜特徵更優異的對齊。
  • 重新取樣器/抽象器:先進工具,如 Perceiver Resampler(Flamingo)或 Q-Former,能將高維數據壓縮為固定長度的詞元。

3. 解碼策略

  • 離散詞元:將輸出表示為特定詞典條目(例如 VideoPoet)。
  • 連續嵌入:使用「軟」信號來引導專用的下游生成器(例如 NExT-GPT)。
投影規則
要讓語言模型處理聲音或3D物件,訊號必須被投射到語言模型現有的語意空間中,使其被視為「模態訊號」而非雜訊。
alignment_bridge.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
Question 1
Which projection technique is generally considered superior to a simple Linear layer for complex modality alignment?
Token Dropping
Two-layer MLP or Resamplers (e.g., Q-Former)
Softmax Activation
Linear Projection
Question 2
What is the primary role of ImageBind or LanguageBind in this architecture?
To generate text from images
To compress video files
To create a Unified/Joint representation space for multiple modalities
To increase the LLM context window
Challenge: Designing an Any-to-Any System
Diagram the flow for an MLLM that takes an Audio input and generates a 3D model.
You are tasked with architecting a pipeline that allows an LLM to "listen" to an audio description and output a corresponding 3D object. Define the three critical steps in this pipeline.
Step 1
Select the correct encoder for the input signal.
Solution:
Use an Audio Encoder such as Whisper or HuBERT to transform the raw audio waves into feature vectors.
Step 2
Apply a Projection Layer.
Solution:
Pass the audio feature vectors through a Multi-layer MLP or a Resampler to align them with the LLM's internal semantic space (dimension matching).
Step 3
Generate and Decode the output.
Solution:
The LLM processes the aligned tokens and outputs "Modality Signals" (continuous embeddings or discrete tokens). These signals are then passed to a 3D-specific decoder (e.g., a 3D Diffusion model) to generate the final 3D object.